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Abstract 

We study optimization algorithms based on variance reduction for stochastic gra¬ 
dient descent (SGD). Remarkable recent progress has been made in this direc¬ 
tion through development of algorithms like SAG, SVRG, SAGA. These algo¬ 
rithms have been shown to outperform SGD, both theoretically and empirically. 
However, asynchronous versions of these algorithms—a crucial requirement for 
modern large-scale applications—have not been studied. We bridge this gap by 
presenting a unifying framework for many variance reduction techniques. Subse¬ 
quently, we propose an asynchronous algorithm grounded in our framework, and 
prove its fast convergence. An important consequence of our general approach 
is that it yields asynchronous versions of variance reduction algorithms such as 
SVRG and SAGA as a byproduct. Our method achieves near linear speedup in 
sparse settings common to machine learning. We demonstrate the empirical per¬ 
formance of our method through a concrete realization of asynchronous SVRG. 


1 Introduction 


There has been a steep rise in recent work [6, 7, 9-12, 25, 27, 29] on “variance reduced” stochastic 
gradient algorithms for convex problems of the finite-sum form: 


min f(x) 

xeR d 


n ^i =t 




( 1 . 1 ) 


Under strong convexity assumptions, such variance reduced (VR) stochastic algorithms attain better 
convergence rates (in expectation) than stochastic gradient descent (SGD) [18, 24], both in theory 
and practice. 1 The key property of these VR algorithms is that by exploiting problem structure 
and by making suitable space-time tradeoffs, they reduce the variance incurred due to stochastic 
gradients. This variance reduction has powerful consequences: it helps VR stochastic methods 
attain linear convergence rates, and thereby circumvents slowdowns that usually hit SGD. 


’Though we should note that SGD also applies to the harder stochastic optimization problem min F(x) = 
E[/(x; £)], which need not be a finite-sum. 
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Although these advances have great value in general, for large-scale problems we still require par¬ 
allel or distributed processing. And in this setting, asynchronous variants of SGD remain indispens¬ 
able [2, 8, 13, 21, 28, 30]. Therefore, a key question is how to extend the synchronous finite-sum 
VR algorithms to asynchronous parallel and distributed settings. 

We answer one part of this question by developing new asynchronous parallel stochastic gradient 
methods that provably converge at a linear rate for smooth strongly convex finite-sum problems. 
Our methods are inspired by the influential Svrg [10], S2gd [12], Sag [25] and Saga [6] family 
of algorithms. We list our contributions more precisely below. 

Contributions. Our paper makes two core contributions: (i) a formal general framework for vari¬ 
ance reduced stochastic methods based on discussions in [6]; and (ii) asynchronous parallel VR al¬ 
gorithms within this framework. Our general framework presents a formal unifying view of several 
VR methods (e.g., it includes SAGA and SVRG as special cases) while expressing key algorithmic 
and practical tradeoffs concisely. Thus, it yields a broader understanding of VR methods, which 
helps us obtain asynchronous parallel variants of VR methods. Under sparse-data settings common 
to machine learning problems, our parallel algorithms attain speedups that scale near linearly with 
the number of processors. 

As a concrete illustration, we present a specialization to an asynchronous SVRG-like method. We 
compare this specialization with non-variance reduced asynchronous SGD methods, and observe 
strong empirical speedups that agree with the theory. 

Related work. As already mentioned, our work is closest to (and generalizes) Sag [25], SAGA [6], 
SVRG [10] and S2gd [12], which are primal methods. Also closely related are dual methods such 
as SDCA [27] and Finito [7], and in its convex incarnation MlSO [16]; a more precise relation 
between these dual methods and VR stochastic methods is described in Defazio’s thesis [5], By 
their algorithmic structure, these VR methods trace back to classical non-stochastic incremental 
gradient algorithms [4], but by now it is well-recognized that randomization helps obtain much 
sharper convergence results (in expectation). Proximal [29] and accelerated VR methods have also 
been proposed [20, 26]; we leave a study of such variants of our framework as future work. Finally, 
there is recent work on lower-bounds for finite-sum problems [1], 

Within asynchronous SGD algorithms, both parallel [21] and distributed [2, 17] variants are known. 
In this paper, we focus our attention on the parallel setting. A different line of methods is that 
of (primal) coordinate descent methods, and their parallel and distributed variants [14, 15, 19, 22, 
23]. Our asynchronous methods share some structural assumptions with these methods. Finally, 
the recent work [11] generalizes S2GD to the mini-batch setting, thereby also permitting parallel 
processing, albeit with more synchronization and allowing only small mini-batches. 

2 A General Framework for VR Stochastic Methods 

We focus on instances of (1.1) where the cost function f{x) has an L-Lipschitz gradient, so that 
|jV/(x) — V/(y)|| < L\\x — y\\, and it is A-strongly convex, i.e., for all x, y £ R d , 

/O) > f(y) + (Vf(y),x-y) + |||a: - yf. (2.1) 

While our analysis focuses on strongly convex functions, we can extend it to just smooth convex 
functions along the lines of [6, 29], 

Inspired by the discussion on a general view of variance reduced techniques in [6], we now describe 
a formal general framework for variance reduction in stochastic gradient descent. We denote the 
collection {/,;}" =1 of functions that make up / in (1.1) by T. For our algorithm, we maintain an 
additional parameter a\ £ for each /,; £ T. We use A 4 to denote {a*}” =1 . The general iterative 
framework for updating the parameters is presented as Algorithm 1 . Observe that the algorithm is 
still abstract, since it does not specify the subroutine ScheduleUpdate. This subroutine deter¬ 
mines the crucial update mechanism of {a*} (and thereby of A f ). As we will see different schedules 
give rise to different fast first-order methods proposed in the literature. The part of the update based 
on A t is the key for these approaches and is responsible for variance reduction. 

Next, we provide different instantiations of the framework and construct a new algorithm derived 
from it. In particular, we consider incremental methods SAG [25], Svrg [10] and SAGA [6], and 
classic gradient descent GradientDescent for demonstrating our framework. 
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ALGORITHM 1: Generic Stochastic Variance Reduction Algorithm 
Data: x° £ a° = x° Vi € [n] = {1,..., n}, step size 77 > 0 
Randomly pick a It = {io, ■ ■ • At} where i t £ {1,... ,n} V t £ {0, ...,T} ; 
for t = 0 to T do 

Update iterate as x t+1 £- x t - 77 (V /^(x*) - V/ it (a| t ) + \ E* v /i( a D) ! 
A t+1 = SCHEDULEUPDATE({a; i }‘+o, A t ,t,I T ) ; 

end 

return x T 


Figure 1 shows the schedules for the aforementioned algorithms. In case of SVRG, 
ScheduleUpdate is triggered every m iterations (here to denotes precisely the number of in¬ 
ner iterations used in [10]); so A t remains unchanged for the to iterations and all a\ are updated 
to the current iterate at the to* iteration. For Saga, unlike SVRG, A 4 changes at the t th iteration 
for all t £ \T\. This change is only to a single element of A 4 , and is determined by the index i t 
(the function chosen at iteration t). The update of SAG is similar to SAGA insofar that only one of 
the on is updated at each iteration. However, the update for A t+1 is based on i t + \ rather than i f . 
This results in a biased estimate of the gradient, unlike SVRG and SAGA. Finally, the schedule for 
gradient descent is similar to Sag, except that all the ctj’s are updated at each iteration. Due to the 
full update we end up with the exact gradient at each iteration. This discussion highlights how the 
scheduler determines the resulting gradient method. 

To motivate the design of another schedule, let us consider the computational and storage costs of 
each of these algorithms. For SVRG, since we update A 4 after every to iterations, it is enough to 
store a full gradient, and hence, the storage cost is O(d). However, the running time is O(d) at 
each iteration and 0(nd) at the end of each epoch (for calculating the full gradient at the end of 
each epoch). In contrast, both SAG and Saga have high storage costs of O(nd) and running time 
of 0(d) per iteration. Finally, GRADIENTDESCENT has low storage cost since it needs to store the 
gradient at 0(d) cost, but very high computational costs of 0(nd) at each iteration. 

SVRG has an additional computation overhead at the end of each epoch due to calculation of the 
whole gradient. This is avoided in Sag and Saga at the cost of additional storage. When to 
is very large, the additional computational overhead of SVRG amortized over all the iterations is 
small. However, as we will later see, this comes at the expense of slower convergence to the optimal 
solution. The tradeoffs between the epoch size m, additional storage, frequency of updates, and the 
convergence to the optimal solution are still not completely resolved. 


SVRG:ScheduleUpdate^x*},■+;!), A\ t, I T ) 


SAGA:ScheduleUpdate({.t*}‘+o, A\ t, I T ) 

for i = 1 to n do 


for i = 1 to n do 

ct 4+1 = 1 (to + 1 (mj(t)al ; 


a- +1 =l(i t = ijx 1 + 1 (z t ^ i)a\ ; 

end 


end 

return A t+1 


return A t+1 



SAG:SCHEDULEUPDATE({a; i }-j:o, A 4 , t, I T ) 


GD:SCHEDULEUPDATE({cU}‘io, A 4 , t, I T ) 

for i = 1 to n do 


for i = 1 to n do 

a* +1 = 1 (i t+1 = i) x t+1 + ± i)a\ ; 


| a * +1 = x t+1 ; 

end 


end 

return A t+1 


return A t+1 


Figure 1: ScheduleUpdate function for Svrg (top left). Saga (top right). Sag (bottom left) 
and GradientDescent (bottom right). While Svrg is epoch-based, rest of algorithms perform 
updates at each iteration. Here a\b denotes that a divides b. 

A straightforward approach to design a new scheduler is to combine the schedules of the above al¬ 
gorithms. This allows us to tradeoff between the various aforementioned parameters of our interest. 
We call this schedule hybrid stochastic average gradient (Hsag). Here, we use the schedules of 
Svrg and Saga to develop Hsag. However, in general, schedules of any of these algorithms can 
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Hsag:ScheduleUpdate(x \A\t,I T ) 

for i = 1 to n do 

n t+ i_f + 

i \ l(s» | t)x* + l(s i/ ft)a- 

end 

return A t+l 


if i £ S 
if i $. S 


Figure 2: ScheduleUpdate for Hsag. This algorithm assumes access to some index set S and 
the schedule frequency vector s. Recall that a\b denotes a divides b 


be combined to obtain a hybrid algorithm. Consider some S C [n], the indices that follow SAGA 
schedule. We assume that the rest of the indices follow an SVRG-like schedule with schedule fre¬ 
quency Si for all* £ S = [n] \ S. Figure 2 shows the corresponding update schedule of Hsag. If 
S = [n] then HSAG is equivalent to Saga, while at the other extreme, for 5 = 0 and s t = m for all 
i £ [n], it corresponds to Svrg. HSAG exhibits interesting storage, computational and convergence 
trade-offs that depend on S. In general, while large cardinality of S likely incurs high storage costs, 
the computational cost per iteration is relatively low. On the other hand, when cardinality of S is 
small and sf s are large, storage costs are low but the convergence typically slows down. 

Before concluding our discussion on the general framework, we would like to draw the reader’s 
attention to the advantages of studying Algorithm 1 . First, note that Algorithm 1 provides a unifying 
framework for many incremental/stochastic gradient methods proposed in the literature. Second, and 
more importantly, it provides a generic platform for analyzing this class of algorithms. As we will 
see in Section 3, this helps us develop and analyze asynchronous versions for different finite-sum 
algorithms under a common umbrella. Finally, it provides a mechanism to derive new algorithms by 
designing more sophisticated schedules; as noted above, one such construction gives rise to HSAG. 


2.1 Convergence Analysis 


In this section, we provide convergence analysis for Algorithm 1 with HSAG schedules. As observed 
earlier, Svrg and Saga are special cases of this setup. Our analysis assumes unbiasedness of the 
gradient estimates at each iteration, so it does not encompass Sag. For ease of exposition, we 
assume that all Sj = m for all i £ [n]. Since Hsag is epoch-based, our analysis focuses on the 
iterates obtained after each epoch. Similar to [10] (see Option II of Svrg in [10]), our analysis will 
be for the case where the iterate at the end of (k + l) st epoch, x krn+m , is replaced with an element 
chosen randomly from {x km , ..., x km+m ~ 1 } with probability {pi, ■ ■ ■ ,p m }- For brevity, we use 
x k to denote the iterate chosen at the k' h epoch. We also need the following quantity for our analysis; 

G fc ^ 1 X] (/,:(af m ) - fi(x*) - (V/.Or*), a km - x *)) . 
n ies 


Theorem 1. For any positive parameters c, /?, k > 1, step size q and epoch size m, we define the 
following quantities: 


7 = K 




2cq{\ - Li 7(1 + /?)) 


1 

n 


2c 

n\ 


6 = max 



2 Lcq 2 
7 





Suppose the probabilities pi oc (1 — F) m and that c, /?, k, step size q and epoch size m are chosen 
such that the following conditions are satisfied: 


1 9 

—b 2 Lcq 2 

K 



< —, 7 > 0, 9 < 1. 

n 


Then, for iterates of Algorithm 1 under the HSAG schedule, we have 


E 


/(i fe+i ) - f(x*) + -Gk+i 


< 6»E 


/(**) - /(**) + ~G k 
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As a corollary, we immediately obtain an expected linear rate of convergence for Hsag. 

Corollary 1. Note that Gk > 0 and therefore, under the conditions specified in Theorem 1 and with 
9 = 9 (1 + I/7) < 1 we have 

e [/(i fc ) - /(**)] < e k [f(x°) - /(**)]. 


We emphasize that there exist values of the parameters for which the conditions in Theorem 1 
and Corollary 1 are easily satisfied. For instance, setting 7 = l/16(An + L), n = 4/Ary, 
/3 = (2A n + L)/L and c = 2/rjn, the conditions in Theorem 1 are satisfied for sufficiently large 
to. Additionally, in the high condition number regime of L/X = n, we can obtain constant 9 < 1 
(say 0.5) with to = 0(n) epoch size (similar to [ 6 , 10]). This leads to a computational com¬ 
plexity of 0(n log(l/e)) for Hsag to achieve e accuracy in the objective function as opposed to 
0(n 2 log(l/e)) for batch gradient descent method. Please refer to the appendix for more details on 
the parameters in Theorem 1 . 

3 Asynchronous Stochastic Variance Reduction 

We are now ready to present asynchronous versions of the algorithms captured by our general frame¬ 
work. We first describe our setup before delving into the details of these algorithms. Our model of 
computation is similar to the ones used in Hogwild! [21] and AsySCD [14]. We assume a multicore 
architecture where each core makes stochastic gradient updates to a centrally stored vector x in an 
asynchronous manner. There are four key components in our asynchronous algorithm; these are 
briefly described below. 

1. Read: Read the iterate x and compute the gradient V/,, (x) for a randomly chosen i t . 

2. Read schedule iterate: Read the schedule iterate A and compute the gradients required 
for update in Algorithm 1. 

3. Update: Update the iterate x with the computed incremental update in Algorithm 1. 

4. Schedule Update: Run a scheduler update for updating A. 

Each processor repeatedly runs these procedures concurrently, without any synchronization. Hence, 
x may change in between Step 1 and Step 3. Similarly, A may change in between Steps 2 and 4. In 
fact, the states of iterates x and A can correspond to different time-stamps. We maintain a global 
counter t, to track the number of updates successfully executed. We use D(t ) £ [t] and D'(t) £ [f] 
to denote the particular x-iterate and ,1-iterate used for evaluating the update at the I th iteration. We 
assume that the delay in between the time of evaluation and updating is bounded by a non-negative 
integer r, i.e., t — D(t) < r and t — D'{t) < r. The bound on the staleness captures the degree of 
parallelism in the method: such parameters are typical in asynchronous systems (see e.g., [3, 14]). 
Furthermore, we also assume that the system is synchronized after every epoch i.e., D(t) > km for 
t > km. We would like to emphasize that the assumption is not strong since such a synchronization 
needs to be done only once per epoch. 

For the purpose of our analysis, we assume a consistent read model. In particular, our analysis 
assumes that the vector x used for evaluation of gradients is a valid iterate that existed at some point 
in time. Such an assumption typically amounts to using locks in practice. This problem can be 
avoided by using random coordinate updates as in [21] (see Section 4 of [21]) but such a procedure 
is computationally wasteful in practice. We leave the analysis of inconsistent read model as future 
work. Nonetheless, we report results for both locked and lock-free implementations (see Section 4). 

3.1 Convergence Analysis 

The key ingredients to the success of asynchronous algorithms for multicore stochastic gradient 
descent are sparsity and “disjointness” of the data matrix [21]. More formally, suppose fi only 
depends on x ei where e, C \rl] i.e., fi acts only on the components of x indexed by the set e,;. Let 
||x||f denote \\ x j II > then, the convergence depends on A, the smallest constant such that 

Ej[||a;||i] < A||x|| 2 . Intuitively, A denotes the average frequency with which a feature appears in 
the data matrix. We are interested in situations where A <C 1. As a warm up, let us first discuss 
convergence analysis for asynchronous Svrg. The general case is similar, but much more involved. 
Hence, it is instructive to first go through the analysis of asynchronous Svrg. 
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Theorem 2. Suppose step size p, epoch size m are chosen such that the following condition holds: 


0 < 9„ := - 


1 

A r)m 


4 L 


(l-4 L ( 


f _n±LArfnf_\\ 
y 1 —2L 2 Ar/ 2 r 2 J ) 

tj+LAt 2 ^ 2 \ \ 

1 — 2L 2 Atp t 2 J ) 


< 1. 


Then, for the iterates of an asynchronous variant of Algorithm 1 with SVRG schedule and probabil¬ 
ities pi = 1/m for all i € [m], we have 

E if(x k+1 ) - fix*)} < e s E[f(x k ) - /(**)]. 


The bound obtained in Theorem 2 is useful when A is small. To see this, as earlier, consider the 
indicative case where L/A = n. The synchronous version of SVRG obtains a convergence rate of 
9 = 0.5 for step size p = 0.1/L and epoch size m = 0(n). For the asynchronous variant of SVRG, 
by setting p = 0.1/2(max{l, A 1 / 2 r}L), we obtain a similar rate with m = 0(n + A 1 / 2 rn). 
To obtain this, set p = p/L where p = 0.1/2(max{l, A 1 / 2 r}) and 9 S = 0.5. Then, a simple 
calculation gives the following: 


m 

n 


2 / 1 - 2A r 2 p 2 \ 
p 1,1-12 p- 14Ar 2 p 2 ) 


< d max{l, A 1,/2 r}, 


where d is some constant. This follows from the fact that p = 0.1/2(max{l, A 1,/2 t}). Suppose 
t < 1/A 1 / 2 . Then we can achieve nearly the same guarantees as the synchronous version, but r 
times faster since we are running the algorithm asynchronously. For example, consider the sparse 
setting where A = o(l/n); then it is possible to get near linear speedup when r = o(n 1 / 2 ). On the 
other hand, when A 1 / 2 r > 1, we can obtain a theoretical speedup of 1/A 1 / 2 . 


We finally provide the convergence result for the asynchronous algorithm in the general case. The 
proof is complicated by the fact that set A, unlike in SVRG, changes during the epoch. The key idea 
is that only a single element of A changes at each iteration. Furthermore, it can only change to one 
of the iterates in the epoch. This control provides a handle on the error obtained due to the staleness. 
Due to space constraints, the proof is relegated to the appendix. 

Theorem 3. For any positive parameters c, /3, k > 1, step size p and epoch size m, we define the 
following quantities: 


C = ( cp 2 + ( 1 - ^ ) cLAr 2 p 3 ) , 


7 a = K 


1—11 - 

AC 


= max ■ 


2c 

7a A 


1 - 


8 CL (l + A) 


2c 

96/Xr 

nA 

n 

L 

l - 

f 1 - 1 



V K 


1 - - 
K 


1 - 


Suppose probabilities pi oc (1 — j:) m *, parameters /3, k, step-size p, and epoch size m are chosen 
such that the following conditions are satisfied: 


-+8C L 

hi 



+ 96C Lt 
n 




1 

12L 2 At 2 ’ 


7a > o, e a < 1. 


Then, for the iterates of asynchronous variant of Algorithm 1 with HSAG schedule we have 


E 

f(x k+1 ) - f(x*) + —Gk+i 

< 9 a E 

f(x k ) - f(x*) + —Gk 


L 7 a J 


L 7 a J 


Corollary 2. Note that Gk > 0 and therefore, under the conditions specified in Theorem 3 and with 
0a = 9 a (1 + l/7a) < 1, we have 

E [ f(x k ) - f(x *)] < e k [f(x°) /CO]. 
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Figure 3: Z 2 -regularized logistic regression. Speedup curves for Lock-Free Svrg and Locked SVRG 
on rcvl (left), real-sim (left center), news20 (right center) and url (right) datasets. We report the 
speedup achieved by increasing the number of threads. 


By using step size normalized by A 1 / 2 r (similar to Theorem 2) and parameters similar to the ones 
specified after Theorem 1 we can show speedups similar to the ones obtained in Theorem 2. Please 
refer to the appendix for more details on the parameters in Theorem 3. 

Before ending our discussion on the theoretical analysis, we would like to highlight an important 
point. Our emphasis throughout the paper was on generality. While the results are presented here in 
full generality, one can obtain stronger results in specific cases. For example, in the case of Saga, 
one can obtain per iteration convergence guarantees (see [6]) rather than those corresponding to per 
epoch presented in the paper. Also, SAGA can be analyzed without any additional synchronization 
per epoch. However, there is no qualitative difference in these guarantees accumulated over the 
epoch. Furthermore, in this case, our analysis for both synchronous and asynchronous cases can be 
easily modified to obtain convergence properties similar to those in [6], 


4 Experiments 

We present our empirical results in this section. For our experiments, we study the problem of 
binary classification via Z 2 -regularized logistic regression. More formally, we are interested in the 
following optimization problem: 

1 ” 

min - ^2 (log(l + exp {yizjx)) + A||x|| 2 ) , (4.1) 

x Tl 

i =1 

where Zi £ W l and yi is the corresponding label for each i £ [n\. In all our experiments, we set 
A = 1/n. Note that such a choice leads to high condition number. 

A careful implementation of Svrg is required for sparse gradients since the implementation as 
stated in Algorithm 1 will lead to dense updates at each iteration. For an efficient implementation, a 
scheme like the ‘just-in-time’ update scheme, as suggested in [25], is required. Due to lack of space, 
we provide the implementation details in the appendix. 

We evaluate the following algorithms for our experiments: 

• Lock-Free SVRG: This is the lock-free asynchronous variant of Algorithm 1 using Svrg sched¬ 
ule; all threads can read and update the parameters with any synchronization. Parameter updates 
are performed through atomic compare-and-swap instruction [21], A constant step size that 
gives the best convergence is chosen for the dataset. 

• Locked SVRG: This is the locked version of the asynchronous variant of Algorithm 1 using 
Svrg schedule. In particular, we use a concurrent read exclusive write locking model, where all 
threads can read the parameters but only one threads can update the parameters at a given time. 
The step size is chosen similar to Lock-Free SVRG. 

• Lock-Free SGD: This is the lock-free asynchronous variant of the SGD algorithm (see [21]). 
We compare two different versions of this algorithm: (i) Sgd with constant step size (referred 
to as CSgd). (ii) SGD with decaying step size 770 \Jcro/(t + op) (referred to as DSGD), where 
constants 770 and <ro specify the scale and speed of decay. For each of these versions, step size is 
tuned for each dataset to give the best convergence progress. 
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Figure 4: / 2 -regularized logistic regression. Training loss residual fix) — f(x*) versus time plot of 
Lock-Free SVRG, DSgd and CSGD on rcvl (left), real-sim (left center), news20 (right center) and 
url (right) datasets. The experiments are parallelized over 10 cores. 


All the algorithms were implemented in C++ 2 . We run our experiments on datasets from LIBSVM 
website 3 * . Similar to [29], we normalize each example in the dataset so that 11^11 2 = 1 for all 
i G [n]. Such a normalization leads to an upper bound of 0.25 on the Lipschitz constant of the 
gradient of _/). The epoch size m is chosen as 2 n (as recommended in [10]) in all our experiments. 
In the first experiment, we compare the speedup achieved by our asynchronous algorithm. To this 
end, for each dataset we first measure the time required for the algorithm to each an accuracy of 
10 -10 (i.e., f(x) — fix*) < 10~ 10 ). The speedup with P threads is defined as the ratio of the 
runtime with a single thread to the runtime with P threads. Results in Figure 3 show the speedup 
on various datasets. As seen in the figure, we achieve significant speedups for all the datasets. 
Not surprisingly, the speedup achieved by Lock-free SVRG is much higher than ones obtained by 
locking. Furthermore, the lowest speedup is achieved for rcvl dataset. Similar speedup behavior 
was reported for this dataset in [21]. It should be noted that this dataset is not sparse and hence, is a 
bad case for the algorithm (similar to [ 21 ]). 

For the second set of experiments we compare the performance of Lock-Free SVRG with stochastic 
gradient descent. In particular, we compare with the variants of stochastic gradient descent, DSgd 
and CSGD, described earlier in this section. It is well established that the performance of variance 
reduced stochastic methods is better than that of Sgd. We would like to empirically verify that such 
benefits carry over to the asynchronous variants of these algorithms. Figure 4 shows the performance 
of Lock-Free SVRG, DSgd and CSGD. Since the computation complexity of each epoch of these 
algorithms is different, we directly plot the objective value versus the runtime for each of these 
algorithms. We use 10 cores for comparing the algorithms in this experiment. As seen in the figure, 
Lock-Free SVRG outperforms both DSgd and CSGD. The performance gains are qualitatively 
similar to those reported in [10] for the synchronous versions of these algorithms. It can also be 
seen that the DSgd, not surprisingly, outperforms CSGD in all the cases. In our experiments, we 
observed that Lock-Free SVRG, in comparison to Sgd, is relatively much less sensitive to the step 
size and more robust to increasing threads. 

5 Discussion & Future Work 

In this paper, we presented a unifying framework based on [ 6 ], that captures many popular variance 
reduction techniques for stochastic gradient descent. We use this framework to develop a simple 
hybrid variance reduction method. The primary purpose of the framework, however, was to provide 
a common platform to analyze various variance reduction techniques. To this end, we provided 
convergence analysis for the framework under certain conditions. More importantly, we propose an 
asynchronous algorithm for the framework with provable convergence guarantees. The key conse¬ 
quence of our approach is that we obtain asynchronous variants of several algorithms like SVRG, 
SAGA and S2 gd. Our asynchronous algorithms exploits sparsity in the data to obtain near linear 
speedup in settings that are typically encountered in machine learning. 

For future work, it would be interesting to perform an empirical comparison of various schedules. 
In particular, it would be worth exploring the space-time-accuracy tradeoffs of these schedules. We 
would also like to analyze the effect of these tradeoffs on the asynchronous variants. 

2 All experiments were conducted on a Google Compute Engine nl-highcpu-32 machine with 32 processors 

and 28.8 GB RAM. 

3 http://www.csie.ntu.edu,tw/~cjlin/libsvmtools/datasets/binary.html 
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A Appendix 

Notation: We use Df to denote the Bregman divergence (defined below) for function /. 

Df(x, y ) = f(x) - f(y) - (V/(y), x-y). 

For ease of exposition, we use E[A'] to denote the expectation the random variable X with respect 
to indices {i i ,... ,i t } when X depends on just these indices up to step t. This dependence will be 
clear from the context. We use 1 to denote the indicator function. We assume Yd=i a d = 0 if i > j. 

We would like to clarify the definition of x km here. As noted in the main text, we assume that 
x km+m j s re pi acec j with an element chosen randomly from {x km , ..., x km+m ~ 1 } with probability 
{pi, ■ • • ,p m } at the end of the (k + l) th epoch. However, whenever x km appears in the analysis 
(proofs), it represents the iterate before this replacement. 


Implementation Details 

Since we are interested in sparse datasets, simply taking fi(x) = log(l+exp(?/iz7cc))+A||at|| 2 is not 
efficient as it requires updating the whole vector x at each iteration. This is due to the regularization 
term in each of the /)’s. Instead, similar to [21], we rewrite problem in (4.1) as follows: 

min - E ( + e MVizJ x )) + x E I ’ 

X Tl \ CL q I 

i=l \ j£nz(zi) J ) 

where nz(z) represents the non-zero components of vector z, and dj = Y,. 1 (j £ nz(zi)). While 
this leads to sparse gradients at each iteration, updates in Svrg are still dense due to the part of 
the update that contains Y, V/jja, ) /n. This problem can be circumvented by using the following 
update scheme. First, recall that for SVRG, Y, ^fi( a i)/ n does not change during an epoch (see 
Figure 1). Therefore, during the [k + l) st epoch we have the following relationship: 


t -1 

x k ~r) E (V/< J -(® 3 ')-V/ ii (i fc )) 

— 


j—km 


2—1 


We maintain each bracketed term separately. The updates to the first term in the above equation are 
sparse while those to the second term are just a simple scalar addition, since we already maintain 
the average gradient Y^i V fi(x k )/n. When the gradient of f lf at x f is needed, we only calculate 
components of x l required for /,, on the fly by aggregating these two terms. Hence, each step of 
this update procedure can be implemented in a way that respects sparsity of the data. 


Proof of Theorem 1 

Proof. We expand function / as f(x) = g(x) + h{x) where g{x) = 4 Yi^s fi( x ) an d h(x) = 
n Yi<ts Y ( x )• Let the present epoch be k + 1. We define the following: 

v/ it (xVv/i>U + ^E v /tK‘) 

i 

Gt = 1 E - /iOO - <V/iOO,o{ - X*}) 
ies 

R t = E\c\\x t -x*\\ 2 +Gi] . 

We first observe that E[r> 4 ] = — V/(x 4 ). This follows from the unbiasedness of the gradient at each 
iteration. Using this observation, we have the following: 

E[Rt+i] = E[c||a: t+1 - s*|| 2 + G t+ i\ = EjcH^ + gY - x*|| 2 + G t+1 ] 

= cE [Hz* - a:*|| 2 ] + cq 2 E [||i' t || 2 ] + 2cgE [(at* - x*,!’*)] + E[G t+ i] 

< cE [Har* - x*|| 2 ] + crfE [||v t || 2 ] - 2c?;E [/(**) - f(x*)} + E[G t+1 ]. (A.l) 


v* = -{x t+1 -x t ) = - 
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The last step follows from convexity of / and the unbiasedness of v*. We have the following rela¬ 
tionship between Gt+i and Gt. 


E[G t+1 ] = 1 - - E [G t ] + -E 


(/»(**) - fi{x*) - (Vfi(x*),x f - x*)) 


n „ 
L i&s 


= ( 1-I E [G t ] H— E[D g {x t ,x*)]. 


(A.2) 


This follows from the definition of the schedule of Hsag for indices in S. Substituting the above 
relationship in Equation (A.l) we get the following. 

Rt+i < Rt+crj 2 E [Ml 2 ] - 2cr,E [/(**) - /(s*)] - ^E[G t ] + ^-E[D g {x t , a:*)] 

< f 1 - Rt + -EIM - a;*|| 2 ] +c?7 2 E [||v*|| 2 ] - 2c?yE [/(a; 4 ) - f(x*)] 

\ hi J n 

+ f- - -'j E[G t ] + -E[D g (x t ,x*)] 

\k n J n 

:= ( 1 _ i) Rt+bt - 


We describe the bounds for b t (defined below). 


b t = C - ElHs* - at*|| 2 ] +crrE [||t/|| 2 ] —2cryE [/(s‘) - /(**)] 
Ti " T 2 ^ 

+ (l-l)nG t ] + lnD g ( X \x*)]. 


The terms 1\ and T 2 can be bounded in the following fashion: 


r 1 =E[||s‘-s*|| 2 ]<|E[/(s*)-/(s*)] 

T 2 =E [ll^ll 2 ] < (l + £) E [||V/ it (4) - V/M*)|| 2 ] + (1 + m [II V/i t (®*) - V/i t (x*)|| 2 ] 

< ^ (l + E £ [fM) - /(**) - <V/i(s*), a\ - x*)] 

O T 

+^-(i+p)e x ; [/<(**)-/(**)] 

i 

< 2L ^1 + ^ E [G 4 + £> h (s fc , x*)] + 21/(1 + /3)E[/(x*) - /(s*)]. 

The bound on r I\ is due to strong convexity nature of function /. The first inequality and second 
inequalities on T 2 directly follows from Lemma 3 of [6] and simple application of Lemma 1 respec¬ 
tively. The third inequality follows from the definition of Gt and the fact that a\ = x k for all i (f_ S 
and t £ {km,..., km + m — 1}. 
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Substituting these bounds T\ and T 2 in bt, we get 


b f <- 


2cq — 2 cLt 7 2 (1 + &)——+ 
kX 


E [/(at 4 ) - /(at*)] 


1 


1 


+ - + 2 cL v 2 1 + - - - E[G t ] + -nD g (x\ a;*)] 


/3 / n 


n 


+ 2cLrj 2 (l+^E [£>„(**,®*)] 


< - 


2cr) - 2cLr/ 2 (l + /3) - - - ~ E [/(at*) - /(at*)] 
n ka 


+ Q + 2cL V 2 (l + ^ E[G t ] + 2 cLrj 2 (l + ^ E [D h (x k t at*)] 


< - 


2cr] — 2cLr] 2 (l + (3) — — — —7 
n k X 


E [/(at*) - f(x*)\ + 2cLrj 2 (1 + E [D h {x t fe ,at*)] . 


(A.3) 


The second inequality follows from Lemma 2. In particular, we use the fact that /(at) — f(x*) = 
Df(x,x*) and Df( at, at*) = D g (a t, at*) + D h (x, at*) > D g ( at, at*). The third inequality follows 
from the following for the choice of our parameters: 


1 


1 \ 1 


+ 2 W( 1 + ^< n . 


Applying the recursive relationship on Rt+i for m iterations, we get 


m— 1 / \ m—1 —j 


Rkm+m < ( 1 ~ — ) Rk + ( 1 — “ 

3=0 


h 


m+j 


where 


Rk = E 


c||x fc -at*|| 2 + G fe 


Substituting the bound on b t from Equation (A.3) in the above equation we get the following in¬ 
equality: 

1 


Rkm+m — [ 1 ) Rk 


• ••> J- / i O \ / 1 \ m— 3 

- £ ^2cp(l - Lr,( 1 + /J))---^j E [/(at fern+ -') - /(at*)] 

2Lcrf E [/i(£ fe ) - h{ at*) - (Vh(x*),x k - at*)] . 


J=0 

m—1 / ^ ^ m—l—j 

K 


+ S1 1 -: 

3=0 


We now use the fact that is chosen randomly from {at fem ,..., x km+m x } with probabilities 
proportional to {(1 — l/«;) m_1 ,..., 1} we have the following consequence of the above inequality. 

Rkm+m + K 1 - (i - ^2cr;(l - Lr)(l + Z 3 )) - ^ E [/(x fc+1 ) - /(at*)] 


e [/^-/(*•)]+(i-4) E 


G fc 


+ 2LcrfK 


i-i- 


^ E [D h (x k , at*)] . 


For obtaining the above inequality, we used the strongly convex nature of function /. Again, using 
the Bregman divergence based inequality (see Lemma 2) 

/(at) - f(x*) = D f (x, at*) = D g (x, x*) + D h ( at, x*) > D h (x, at*), 
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we have the following inequality 

R-krn+m "T tv 

1 


1 - (i - ^) ^ 2 c? 7(1 - Lr]{ 1 + P)) ~ ^ E [/(® fc+1 ) - /X*)] 


< 


2c 

X 


1 - 


+ 2 Lcq k 1 + 


P 


1 - 1 - 


E [/(i fc ) - /(a:*)] + 1 - 


E 
(A.4) 


G k 


We use the following notation: 

1 


7 = n 


1 - 1 - 


2cq{\ - i? 7 (l + P)) - 1 - ^ 
n ka 


= max 




2 Lcr l 2 f, , 1 i 

— l 1 + ^ ,K 


1 — I 1 — — 

K 


1 - - 
K 


Using the above notation, we have the following inequality from Equation (A.4). 


E 


f(x k+1 ) - /(**) + -Gk+i 
7 


<6 E 


f(x k ) - f(x*) + -G k 
7 


□ 


u — — 

,.t 


V/ It (x D ^)-V/ It (x fe )+V/(i fc ) 
V = - [V/ it (X) - V/ lt (i fc ) + V/(x fe )] . 


We have the following: 

E||X +1 - a:*|| 2 = E||X + rju 1 - tc*|| 2 = E [||X - x*|| 2 + r? 2 ||X|| 2 + 2r](x t - **,«*)] . (A.5) 
We first bound the last term of the above inequality. We expand the term in the following manner: 

E(X — x*,u t ) = E (x* — x 1 , S7fi t (x D ^)) 


= E 


(x*-x D W,Vf it (x D W)) 


T 3 


+ Y V[(x d -x d+1 yf it (x d ))]+ Y v[(x d -x d+ \Vf it (x D W)-Vf it (x d )) 

d=D(t) d=D(t.) 


Ti 


—V- 

n 


(A. 6 ) 

The first equality directly follows from the definition of u 4 and its property of unbiasedness. The 
second step follows from simple algebraic calculations. Terms T 3 and T 4 can be bounded in the 
following way: 

T 3 <E[.f H (x*) ~ f H (x D ^)]. (A.7) 

This bound directly follows from convexity of function f it . 

t -1 

Ta= Y E [< x d - x d+ \\7f it {x d ))] 
d=D(t ) 
t-1 


< E E 

d=D(t) 


L , 


/ lt (X)-/ lf (x d+1 ) + - ||x d+1 -^|| 2 


< E 


LA 


/i t (aX (t) )-/n(X) +— Y E 


\x d+1 — x d \\ 2 } 


(A. 8 ) 


d=D(t ) 


where 6 < 1 is a constant that depends on the parameters used in the algorithm. 

Proof of Theorem 2 

Proof. Let the present epoch be k + 1. Recall that D(t) denotes the iterate used in the £ th iteration 
of the algorithm. We define the following: 
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The first inequality follows from lipschitz continuous nature of the gradient of function . The 
second inequality follows from the definition of A. The last term T 5 can be bounded in the following 
manner. 


T 5 = E 


< E 


< E 


< E 


t -1 


E (x d -x d+ \Vf it (x D W)-Vf it (x d )) 

d=D(t) 


E \\x d+1 - xd U\Vf it (x D ^)-Vf it (x d ) 


d=D(t) 
t -1 


d—1 


E \\x d+1 - x% t E l|V/i t (® J-+1 )-V/ it (^' 

d=D(t) j=D(t ) 


£-1 d—1 

E E 

d=D(t)j=D(t) 


\\ X d+1 - X d \\l + \W + 1 - X >\\l) 


< LA ( t % E \\ X d+1 -x d f. (A.9) 

d=D{t) 

The first inequality follows from Cauchy-Schwartz inequality. The second inequality follows from 
repeated application of triangle inequality. The third step is a simple application of AM-GM in¬ 
equality and the fact that gradient of the function /,; ( is lipschitz continuous. Finally, the last step 
can be obtained by using a simple counting argument, the fact that the staleness in gradient is at 
most r and the definition of A. 


By combining the bounds on T 3 , T 4 and T 5 in Equations (A. 7), (A. 8 ) and (A.9) respectively and 
substituting the sum in Equation (A. 6 ), we get 


E(x t — x*, u*) < E 


f{x*) — fix*) + —y— E 


\x d+x -x d \\ 2 


d=D(t) 

By substituting the above inequality in Equation (A.5), we get 


(A. 10) 


E [|| 


x t+1 -x*\\ 2 ] < E 


£-1 


\ X f - x*\\ 2 +r/ 2 \\u t \\ 2 -2i 1 (f( X t ) - f(x*)) +LArr] 3 E H m ' 

d=D(t) 


d 11 2 


(A. 11) 


We next bound the term EfU^H 2 ] in terms of E L||v*|| 2 J in the following way: 


E [||w 4 || 2 ] < 2E [||u 4 — tz*|| 2 + ||f t || 2 ] 


< 2E 


l|V/ it ( a : t )-V/ it (* 1, W)|| 5 


2E 


<2 L 2 r E E[!|x d+1 ^a ; d || 2 ] + 2E[|| + | 2 ] 

d=D(t ) 

£-1 

<2L 2 Ai 1 2 t E E[|| + || 2 ] + 2E[||t/|| 2 ] . 

d=D(t ) 

The first step follows from Lemma 3 for r = 2. The third inequality follows from the lipschitz 
continuous nature of the gradient and simple application of Lemma 3. Adding the above inequalities 
from t = km to t = km + m — 1 , we get 


fcm+m-1 


km-\-m— 1 


E ® [Ml 2 ] < E 


t—km 


t—km 


£-1 


2L 2 A?fT E E [||+|| 2 ] +2E[||v t || 2 ] 

d=D(t ) 

km-\-m— 1 fcm+m—1 

<MA„V 2 E[ll«T]+2 2 E[||„T] 


t—km 


t—kn 
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Here we again used a simple counting argument and the fact that the delay in the gradients is at most 
r. From the above inequality, we get 


km-\-m— 1 


km-\-m— 1 


E E Hi"‘ii 2 ] ^ E e oi»'i 2 ]- < A - i2 > 

— t—km 


t—km 


Adding Equation (A.l 1) from t = km to t = km + m — 1 and substituting Equation (A. 12) in the 
resultant, we get 


E [\\x km+m - x*|| 2 ] < E \\x k - x*|| 2 + (ry 2 + LAr 2 g 3 ) ^ 


km-\-m— 1 km-\-m— 1 

||2 


- 2i >(f( xt ) - 


t—km 


< E 


\x k — x*\\ 2 + 2 


1 - 2L 2 Ary 2 r 2 


t—km 
km-\-m —1 


„2 , T A-2^3 \ fcm+m-1 

v n ' E ill'll 2 - E 2 •)(/(*•)-/(*•)) 


t—km 


t—km 


Here, we used the fact that the system is synchronized after every epoch. The first step follows from 
telescopy sum and the definition of x k . From Lemma 3 of [6] (also see [10]), we have 

EIH^II 2 ] < 4LE [/Or 4 ) - /Or*) + f(x k ) - f(x*)\ ■ 

Substituting this in the inequality above, we get the following bound: 

rj 2 + LAr 2 g 3 


2r] — 8L 


<[ X+8L 


1 — 2L 2 A?y 2 r 2 
ry 2 + LAr 2 g 3 
1 - 2L 2 A?y 2 r 2 


?nE[/(x fe+1 ) - f{x*)} 
m) E[/(x fe ) - /(x*)]. 


□ 


Proof of Theorem 3 


Proof. Let the present epoch be k + 1. For simplicity, we assume that the iterates x and A used in the 
each iteration are from the same time step (index) i.e., I)(t) = Off) for all t £ T. Recall that D(t) 
and D'(t) denote the index used in the f th iteration of the algorithm. Our analysis can be extended to 
the case of D(t) f D'(t) in a straightforward manner. We expand function / as /(x) = g(x) + h( x) 
where g(x) = £ E; e s M x ) and H x ) = yy E t^s M x )• We define the following: 


u v = — (ar 1 ^ — X" ) = — 
V 


V/ it (^ (t) ) - V/ i8 (af t} ) + - £ V./)(af (t) ) 

' 7 1 * * 


V/OVV/fO + ^V/.fa') 

z Tj Z-^ 


We use the same Lyapunov function used in Theorem 1. We recall the following definitions: 

G t = ~ E (/<(«<) - M x *) - (V/i(x*),a 4 - x*)) 
n tes 

= E [c||x 4 — x*|| 2 + Gt] . 

Using unbiasedness of the gradient we have E[w 4 ] = — V/(x D l 4 l) and E[x 4 ] = —V/(x 4 ). Using 
this observation, we have the following: 

cE[||x t+1 - x*|| 2 ] = cE[||x 4 + ryu 4 - x*|| 2 ] 

= cE [||x 4 — x*|| 2 ] + eg 2 E [||u 4 || 2 ] +2cryE [(x 4 — x*, u 4 )] . (A.13) 

T b , ' TV 


We bound term Tq in the following manner: 

T 6 = E [||w 4 || 2 ] < 2E [||w 4 - x 4 || 2 ] + 2 E[||x 4 || 2 ]. (A.14) 
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The first term can be bounded in the following manner: 


E[||u*-- 


< E 


< 3 E 


(V/ it (**) - V/ it (^ D W)) - (V/ it (a£ W ) - V/ it «)) 




V/ it (^)-V/ it (at DW ) 


+ 3 E 


v/ it K w )-v/ it (4) 


+ 3 E 


^E( v *(“i)- v *(“? (t) )) 


< 3 E 


V/ it 0rVV/ it (z D «) 


+ 3 E 


+ -E E 

■n ^ 


V/i(a‘) - V/i(af (t) ) 


2 


(A.15) 


The second step follows from Lemma 3 for r = 3. The last step follows from simple application of 
Jensen’s inequality. The first term can be bounded easily in the following manner: 


E 


|V/ it (^)-V/ it (^)|| 2 J <L 2 t £ E 

d=D(t) 

t-1 

< L 2 At] 2 t E 

d=D(t) 


x d+1 - x d \\ 2 it ) 


, d||21 


The second and third terms need more delicate analysis. The key insight for our analysis is that at 
most r ai’s differ from time step D(t) to t. This is due to the fact that the delay is bounded by r and 
at most one an changes at each iteration. Furthermore, whenever there is a change in on, it changes 
to one of the iterates x J for some j = {max{t — r, km},..., t}. With this intuition we bound the 
second term in the following fashion. 


E 


v/wKH-v&K) 


t -1 


£ E e 


j=D(t) ies 


1 (i = ij) V/,(^)-V/,(af (t) ) 


o *_ } 

^ - E E e 

j=D(t ) ies 


l(i = *,-) ( II V/i(aJ) - V/i(^)|r + V/ 4 (af (t) ) - 


2 t_1 

ph{x j ) - V/iOr*)!! 2 ' 

+ h E £ E ' 

V/i(a? W ) - V/iOr*) 2 

j=D(t) ies 

t —i r 


j—D(t) tes L 



4 L v- ^ 
< — > E 

-n ^ 


J=D(t) 


-E/^) _ fi{x*),x° - X*)) 

n z —* 


ies 


4 Lr„ 


-E 


- E - /i(*‘) - (V/iOO, af (t) - x*) 


ies 


The first inequality follows from the fact that if o: ( Di: /1 and o:* ( differ, then (a) it should have been 
chosen in one of the iteration j £ {D(t ),..., t — 1} and (b) a it is changed to .t- 7 in that iteration. 
The second inequality follows from Lemma 3 for r = 2. The third inequality follows from the fact 
that the probability P(ij = i) = 1 /n. The last step directly follows from Lemma 1. Note that sum 
is over indices in S since a^’s for i ^ S do not change during the epoch. 

The third term in Equation (A. 15) can be bounded by exactly the same technique we used for the 
second term. The bound, in fact, turns out to identical to second term since i t is chosen uniformly 
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random. Combining all the terms we have 

t—i art i 

T 6 <2E[\\v t \\ 2 ] + 6L 2 Ar 1 2 T Y E [|| U d || 2 ] + — Y E [D g {x\x*)] 

d=D(t) n j=D(t) 


48 Lt 
n 


E [G D{t) ]. 


The term TV can be bounded in a manner similar to one in Theorem 2 to obtain the following (see 
proof of Theorem 2 for details): 


E(x* -x*,u*) < E 


f(x*) ~ /(**) + 


g ||„T 

d=D(t) 


(A.16) 


We need the following bound for our analysis: 


m— 1 


EH 

0 x 


^ \ m— 1—j m—1 


E[||t/' m+i || 2 ] <2 V (l-- 
i=o V * 


Y \ m ~ l ~j 


E[||i> 


,km+j ||2i 


fcm+m-1 t—1 

yg 6L 2 At] 2 t ^ Elllu 

d=D(t) 


d 11 21 


t—km 


km-\-m— 1 . 0 ^ £—1 

+ £ ^ £ e 

t—km j—D(t) 

km-\-m —1 , 0 r 

+ E ^ ■ 


t—km 

The above inequality follows directly from the bound on 7); by adding over all t in the epoch. Under 
the condition 

/ \ m— 1 1 


we 


have the following inequality 


v 2 < 1 - - 


1 

12L 2 Ar 2 ’ 


m— 1 


EH 

j=o v 


m-l—j 


m—l—j m— 1 , 

E[||w fcm+J || 2 ] <4V ( 1 - - 
j=o ' * 

fcm+m-1 r\ r* T t—1 

96L 


E[||« 


,fcm+j 11 2 l 


E ^ E 


The above inequality follows from the fact that 


n 

t—km j—D(t) 

km-\-m— 1 T 

- E 

t—km 


(A.17) 


km-\-m— 1 


t— 1 


fcm+ra—1 


y] 6L 2 AifT Y E [||u d || 2 ] < Y 6L 2 Ar) 2 T 2 E 
d=D(t) 


U 


1 1121 


t—km 


t=km 
m— 1 


1^7 1 


i=o 


m-l-j 


E[||u 


km+j 11 2 1 


The above relationship is due to the condition on 77 and the fact that any d £ {D{t ),.... f - 1 } for 
at most t values of t. We have the following: 

Rt +1 = cE [H®* - x*|| 2 ] + crgE [||t/|| 2 ] + 2 C 77 E [(x* - x*,u*)] + E [G t +t] 


1 


— (1-) Rt + £t- 


(A.18) 
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We bound et in the following manner: 

e t = -\\x t -x*\\ 2 +(---)E[G t }+cr 1 2 E [||«*|| 2 ] + 2crjE [(x* - a:*, u*)] + E[G t+1 ] 

= - a;*|| 2 + f- — E[G t ] + cr] 2 E [||u t || 2 l + 2c?yE \(x* - x* ,u*)] + -E[D„(x t , x*)] 

k \k n) n 

< - { 2 ^ - E [/(x*) - fix*)] + Q - i) E[G t ] + c V 2 E[\\u*\\ 2 ] 

*~ 1 1 

+ cLArr] 3 ^ E [||tt d || 2 ] + -E[D g ix*,x*)]. 

d—D(t) 

The second equality follows from the definition of G*+i (see Equation (A. 2)). 

E[G t+1 ] = (l- ^ E[G t ] + \nD a ix*,x*)\. 


Applying the recurrence relationship in Equation (A. 18) with the derived bound on e*, we have 

m— 1 / \ m—l—j 


R. 


km-\-m — 


<li-- 

K 




&km-\-j 


j =0 


m—l—j 


^ km+j 5 


where e' t is defined as follows 


R k =E 


c\\x K - x*\\ 2 + G k 


e ; = _ ( 2cw _ |£) E [/<„,«) _ /(»-)] + (i - i) eigu 


CTj 


1 - - 


cLAt 2 tj 5 E[||u t || 2 ] + -E [Dgix* ,x*)\. 


The last inequality follows from that fact that the delay is at most r. In particular, each index 
j £ {Dit),... ,t— 1} occurs at most r times. We use the following notation for ease of exposition: 


C = I cr r + (1 - 


1 


K 


cLAt z rj 


2 3 


Substituting the bound in Equation (A. 17), we get the following: 

/ / 2c \ rn ~ 1 f 1 \ m ~ i — 0 

Rkvn+m < - -J Rk~ ^2c?7 - — J ^ ( 1 ~ ~ 


j=0 


E [fix km+ i) - /(**)] 


m-i , 
3=0 V 

96C Lt , 


m—l—j 


E[||t) fcm+ ' J || 2 ] 


+ 


1 96 (Lt 

k n 


1 - - 
K 



m—1—j 


E [j D g ix km+ Gx *)] 

m—l—j 

E[G D{km+j) ]. (A.19) 


We now use the following previously used bound on v* (see bound T 2 in the proof of Theorem 1): 

E[||^|| 2 ] < 2L (l + [G t + D h ix k ,x*)\ + 2L(1 + /3)E [fix*) - /(**)]. 
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Substituting the above bound on v t in Equation (A. 19), we get the following: 


Rkm-\-m 5 : ( 1 ^ ) Rk 




l 

l — 

K 


+ 


3=0 


— + 8 (X ( 1 + — ) + 

m— 1 


m—l—j 


1 \ 96Cir 


E [/(x*”«) - /(x*)] 


n 

m-l-j 


1 - - 
K 




3= 0 


E [Gjfcm+j] 

m-l-j 


E[D h (x k ,x*)\ 


< 


2c 


1 - 


1 


E [/(i fc ) - /(**)] + 1 - 


1 


E 


G fc 


2c,-8CL(i + ffl-E_55i£: 

kA n 


1 - - 
Av 


m—1 




J=0 


m—l—j 


( 1\ 

r ( i\ m i 

( 1 + — ) K 


V PJ 

L v J 


E [f(x km+j ) — f(x*)\ 
E [D h (x k ,x*j\ . 


(A.20) 


The first step is due to the Bregman divergence based inequality Df(x,x*) > I) g (x. x*). The 
second step follows from the expanding R k and using the strong convexity of function /. For 
brevity, we use the following notation: 


7 a = « 


1 - 


9„ = max 


1 

1- 

K 


2 c 

la A 


1 - 


8 CL (l + i) 


2c 

96C Lt 

k\ 

n 

L 

1 - 

(i-- 






We now use the fact that x k+1 is chosen randomly from {x km ,... jX km+m ~ 1 } with probabili¬ 
ties proportional to {(1 — l/«:) m_1 ,..., 1}. Hence, we have the following inequality from Equa¬ 
tion (A.20): 


E 

f(x k+1 ) - f{x*) + —Gk+i 

< 0 a E 

f{x k ) - fix*) + -G k 


V la \ 


V la J 


where 9 a < 1 is a constant that depends on the parameters used in the algorithm. □ 
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Remarks about the parameters in Theorem 1 & Theorem 3 


In this section, we briefly remark about the parameters in Theorems 1 & 3. For Theorem 1, suppose 
we use the following instantiation of the parameters: 


Then we have. 


9 

K 


P 

C 


1 

16(An + L ) 

4 ( 

— = 64 ( n 

A?? V 

2An + L 


L 


— = 32 | A 

ijn 



9 = max 


2(i - 4) m | 1 

3(l-(l4rt + 3(l + f) 



In the interesting case of L/X = n (high condition number regime), since k = 0(n), one can obtain 
a constant 9 (say 9 = 0.5) with m = 0(n). This leads to e accuracy in the objective function after 
0(log(l/e)) epochs of Hsag. When m = 0(n ), the computational complexity of each epoch of 
Hsag is 0(n). Hence, the total computational complexity of Hsag is 0[n log(l/e)). On the other 
hand, because L/X = n, batch gradient descent method requires 0(n log(l/e)) iterations to achieve 
e accuracy in the objective value. Since the complexity of each iteration of gradient descent is 
0(n) (as it passes through the whole dataset for calculating the gradient), the overall computational 
complexity of batch gradient descent is 0(n 2 log(l/e)). In general, for high condition number 
regimes (which is typically the case in machine learning applications), Hsag (like Svrg, Saga) 
will be significantly faster than the batch gradient methods. Furthermore, the convergence rate is 
strictly better than the sublinear rate obtained for Sgd. 


The parameter instantiations for Theorem 3 are much more involved. Suppose A 1 / 2 r < 1 (this is 
the sparse regime that is typically of interest to the machine learning community) and m > n > 9r. 
The other case ( A 1 2 t > 1) can be analyzed in a similar fashion. We set the following parameters: 


Then we have the following: 

c, (1 - 


9 = 


64 (An + L) 

4 256 (n + 

2An + L 


P = 


c = — = 32 
r\n 


A + — 
n 


32n(Xn + L) 


1 + 


L 


64(A n - 


9„ < max • 


6(1 - k J 


+ 


7(l-(l-4) m ) 448(1+4r) 


L), 

195(1 

2 An 'i 
L 


1 - 


Again, in the case of L/X = n, we can obtain constant 9 a (say 9 a = 0.5) with m = O(n) and 
k = 0(n). The constants in the parameters are not optimized and can be improved by a more 
careful analysis. Furthermore, sharper constants can be obtained in specific cases. For example, 
see [10] and Theorem 2 for synchronous and asynchronous convergence rates of SVRG respectively. 
Similarly, sharper constants for Saga can also be derived by simple modifications of the analysis. 
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Other Lemmatta 


Lemma 1. [10] For any ai £ where i £ [n] and x* , we have 

O T 

E [||V/ it (cO - V/ it (^)|| 2 ] < — E [/*(“<) ~ /(**) - < V /i(**), ^ **)] • 

i 

Lemma 2. Suppose f : R d —> K and f = g + h where f, g and h are convex and differentiable, x* 
is the optimal solution to arg min^ f(x) then we have the following 

Df(x, x*) = f(x) - /(:r*) 

Df(x, x*) = D g (x,x*) + D h (x, x*) 

Df(x,x*) > D g {x,x*). 

Proof. The proof follows trivially from the fact that x* is the optimal solution and linearity and 
non-negative properties of Bregman divergence. □ 

Lemma 3. For random variables z \,..., z r , we have 

E [||~i + + z r \\ 2 \ < rE [||zi|| 2 + ... + 11-Zr-11 2 ] • 
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